Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks

نویسندگان

Xiayu Hua

Hao Wu

Zheng Li

Shangping Ren

چکیده

TheHadoopDistributed File System (HDFS) is designed to run on commodity hardware and can be used as a stand-alone general purpose distributed file system (Hdfs user guide, 2008). It provides the ability to access bulk data with high I/O throughput. As a result, this system is suitable for applications that have large I/O data sets. However, the performance of HDFS decreases dramatically when handling the operations of interaction-intensive files, i.e., files that have relatively small size but are frequently accessed. The paper analyzes the cause of throughput degradation issue when accessing interaction-intensive files and presents an enhanced HDFS architecture along with an associated storage allocation algorithm that overcomes the performance degradation problem. Experiments have shown that with the proposed architecture together with the associated storage allocation algorithm, the HDFS throughput for interaction-intensive files increases 300% on average with only a negligible performance decrease for large data set tasks. © 2014 Elsevier Inc. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

HADOOP: A Framework for Distributed Computing

With data growing so rapidly and the rise of unstructured data accounting for about 90 % of the data today, the time has come for the enterprises to re-evaluate their approach to data storage, management and its analysis. This enormously growing data has been given the name Big Data. Hadoop platform has been designed to tackle the problems associated with handling such an enormous data-that doe...

متن کامل

A Relative Study on Task Schedulers in Hadoop MapReduce

Hadoop is a framework for BigData processing in distributed applications. Hadoop cluster is built for running data intensive distributed applications. Hadoop distributed file system is the primary storage area for BigData. MapReduce is a model to aggregate tasks of a job. Task assignment is possible by schedulers. Schedulers guarantee the fair allocation of resources among users. When a user su...

متن کامل

Software Design and Implementation for MapReduce across Distributed Data Centers

Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data are processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly succ...

متن کامل

Achieving Load Balancing of HDFS Clusters Using Markov Model

The combination of Hadoop and HDFS is becoming a defacto standard system in handling big data. HDFS is a distributed file system that is designed for big data. In HDFS, a file consists of multiple large sized blocks. A central management of HDFS tries to scatter these multiple blocks on different nodes to maximize the I/O throughput. Hadoop is a framework that supports data intensive parallel a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

J. Parallel Distrib. Comput.

دوره 74 شماره

صفحات -

تاریخ انتشار 2014

Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks

نویسندگان

چکیده

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

HADOOP: A Framework for Distributed Computing

A Relative Study on Task Schedulers in Hadoop MapReduce

Software Design and Implementation for MapReduce across Distributed Data Centers

Achieving Load Balancing of HDFS Clusters Using Markov Model

عنوان ژورنال:

اشتراک گذاری